natural_language_understandingfandomcom-20200214-history
Neural network language model
Types * LSTM * GRU Difficulties TODO: Chen et al. (2015) survey existing techniques and "further extend self normalization to be a proper estimator of likelihood and introduce an efficient variant of softmax". TODO: a totally different approach: pointing/copying: Gulcehre et al. (2016)Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou and Yoshua Bengio. 2016. Pointing the Unknown Words. ACL. TODO: what about score matching? https://www.cs.helsinki.fi/u/ahyvarin/papers/JMLR05.pdf From Grave et al. (2016)EFFICIENT SOFTMAX APPROXIMATION FOR GPUS Édouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou https://arxiv.org/pdf/1609.04309v2.pdf: "(i) the methods that consider the original distribution and aim at providing approximations of the probabilities, or of a subset of them (Bengio et al., 2003b; Ji et al., 2015), from (ii) the approaches that compute exact probabilities for an approximate model yielding a lower computational cost, such as the popular hierarchical softmax (Goodman, 2001a; Mnih & Hinton, 2009; Morin & Bengio, 2005)." Grave et al. (2016): adaptive softmax Various solutions have been proposed to circumvent the issue. Some of them are variants of the original class decomposition idea (Bengio, 2002) : # importance sampling : (Bengio et al., 2003) ; # uniform sampling of ranking criterion : (Collobert et al., 2011) ; # hierarchical softmax : (Morin et al., 2005) ; # hierarchical log-bilinear model : (Mnih et al., 2009) ; # structured output layer : (Le et al., 2011) ; Others avoid computing the probability distribution: noise-constrastive estimation (Mnih et al., 2012) and negative sampling; or use alternative loss (Vincent et al. 2015). Slow training time TODO: Bloom embeddings (Serrà and Karatzoglou, 2017Serrà, J., & Karatzoglou, A. (2017). Getting deep recommenders fit: Bloom embeddings for sparse binary input/output networks. https://doi.org/10.1145/3109859.3109876) Solution: Approximate Softmax (AKA sampled softmax) Jean et al. (2014)Cho, Sébastien Jean Kyunghyun, Roland Memisevic, and Yoshua Bengio. 2014. "On Using Very Large Target Vocabulary for Neural Machine Translation." PDF: "In practice, hence, we partition the training cor-pus and define a subset V ? of the target vocabulary for each partition prior to training. Before training begins, we sequentially examine each tar-get sentence in the training corpus and accumulate unique target words until the number of unique tar-get words reaches the predefined threshold τ. The accumulated vocabulary will be used for this partition of the corpus during training." Solution: Short-list An output layer size of tens to hundreds thousands has proved to be too computationally expensive. A simple solution is to use a short-list of only thousands words. Originally, Bengio (2003)Y. Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137-1155, 2003. merged all infrequent words into a special class UKN (short hand for "Unknown") that represents probability of all rare words. The rare words within the class have probability estimated based on their unigram frequency. This approach has been later improved by Schwenk and Gauvain (2005)H. Schwenk, J. Gauvain. Training Neural Network Language Models On Very Large Corpora. In Proceedings of Joint Conference HLT/EMNLP, 2005., who redistributed probabilities of rare words using n-gram model. Note that the vocabulary truncation techniques can provide very significant speed-ups, but at a noticeable cost of accuracy. Schwenk did use in some cases as little as 2K output units in the neural network, and even if these correspond to the most frequent words, the performance degradation was significant as was later shown in Le et al. (2011)H.-S. Le, I. Oparin, A. Allauzen, J.-L. Gauvain, F. Yvon. Structured Output Layer Neural Network Language Model. In Proc. of ICASSP, 2011.. The influence of the vocabulary size on the performance of RNNs seems to be an as yet untouched research topic. Solution: Hierarchical architecture Solution: SOUL architecture The vocabulary is divided into two part: # Short-listed words whose probability is computed directly # Other words are further divided into classes and subclasses. Their probability is computed by the following formula: #: P(w_i|h)=P(c_1(w_i)|h) \prod_{d=2}^{D} P(c_d(w_i)|h, c_{1:d-1}) Classes are induced by repeatedly clustering words. More specifically, Le et al. (2011) used a 3-step procedure: # Train a standard NNLM model with the short-list as an output, following the one vector initialization scheme (Le et al., 2010)H.S. Le, A. Allauzen, G. Wisniewski, and F. Yvon, “Training continuous space language models: Some practical issues,” in Proc. of EMNLP’10, Cambridge, MA, 2010, pp. 778–788.. # Reduce the dimension of the context space using a principal component analysis (10 dimensions). # Perform the recursive K-means word clustering based on the distributed representation induced by the context space (except for words in the short-list). Solution: Noise-contrastive estimation Mnih, A., & Teh, Y. W. (2012). A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426.Vaswani, A., Zhao, Y., Fossum, V., & Chiang, D. (2013). Decoding with Large-Scale Neural Language Models Improves Translation. In EMNLP (pp. 1387-1392).Mnih, A., & Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-contrastive estimation. In Advances in Neural Information Processing Systems (pp. 2265-2273). Pre-compute hidden layer Slow decoding time TODO good results: https://arxiv.org/abs/1902.11269 TODO: filter output vocabulary based on co-occurrence with source words (EACL tutorial on neural MT). Solutions to slow training time listed above also help reduce decoding time. Besides, some other techniques are also available in the literature. Solutions: convert into N-gram LM Arisoy, E., Chen, S. F., Ramabhadran, B., & Sethy, A. (2014). Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(1), 184-192. Solution: Self-normalizing NNLM If we can make raw probabilities approximately sum to one all the time, we can avoid repeatedly normalizing them. Devlin et al. (2014)Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., & Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. Proceedings of the ACL, Association for Computational Linguistics, Baltimore. introduced a regularizing term into the loss function to encourage the network to self-normalize. See also: When and why are log-linear models self-normalizing? Hyper-parameters Vocabulary size A typical choice for |V| is between 50.000 and 300.000 (Mikolov, 2012), although in Collobert and Weston (2008) |V|=30.000 was chosen. References Category:Language models Category:Neural networks